Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add markdown formatter / exporter #1976

Open
wants to merge 10 commits into
base: main
Choose a base branch
from
Open

Conversation

mayel
Copy link
Contributor

@mayel mayel commented Dec 8, 2024

lib/ex_doc/cli.ex Outdated Show resolved Hide resolved
@josevalim
Copy link
Member

Thank you @mayel, I think the general direction is good and I think we can continue exploring it. The testing structure will be very important too.

Just a heads up, we will be slow with reviews on our side, since we are focused on Elixir v1.18 and launching Livebook Teams.

@mjrusso
Copy link

mjrusso commented Dec 26, 2024

This is awesome @mayel! And thanks for letting me know about this PR :)

Having spent some time thinking about this, a few requirements suggestions for discussion/debate. My ideal Markdown export would generate:

  • an exact mirror of existing HTML page structure, but in Markdown format (with .md extension for each file, and working hyperlinks between all .md files)
  • at least one "single file" export with all documentation included in a single Markdown file
    • (there might be opportunities to generate a few different types of "single file" exports; here's what hex2txt is currently doing, but it would probably make sense for exdoc to generate a version of the single-file export with all "extras" included)
  • a top-level Markdown file that simply links to every generated Markdown file (like a sitemap, i.e. the intention behind the llms.txt proposal)

I would also recommend generating all of this by default, so tooling can start to rely on these files existing :)

mayel added a commit to mayel/ex_doc that referenced this pull request Dec 26, 2024
@mayel
Copy link
Contributor Author

mayel commented Dec 26, 2024

@mjrusso Thanks for the feedback!

The structure of the files and markdown contents should already match that of the html docs.

In terms of a single file, that was the intention of the ZIP archive containing all the md docs, so it can easily be downloaded from hexdocs and devs can choose which files/modules they want to add as context rather than always including everything, but now I'm thinking we could add a cli flag to generate either a single file or a ZIP with seperate files, and leave the discussion of what the default should be for later?...

I've pushed some WIP I hadn't staged which includes generating an index.md with structured links to all the .md docs.

@mjrusso
Copy link

mjrusso commented Dec 27, 2024

The structure of the files and markdown contents should already match that of the html docs.

Perfect :) I was mostly trying to enumerate my ideal requirements independent of what was already written, just for ease of debate.

add a cli flag to generate either a single file or a ZIP with seperate files, and leave the discussion of what the default should be for later?...

In general my preference would be to make Markdown generation (in whatever form we decide) the default, with no other configuration options (other than disabling it if you really don't want it) so tools can rely on a common approach.

On the topic of the zip, single file generation, etc.:

In terms of a single file, that was the intention of the ZIP archive containing all the md docs, so it can easily be downloaded from hexdocs and devs can choose which files/modules they want to add as context rather than always including everything, but now I'm thinking we could add a cli flag to generate either a single file or a ZIP with seperate files, and leave the discussion of what the default should be for later?...

Since we can already download a tarball of all docs from hex.pm (the md files, if generated, would be included by default there as well, correct?), I think we can forego the zip archive. Easy enough to get the markdown files from there. (And would these be fetched by default with mix hex.docs?)

Thinking through this a bit more, I think we could forego the single-file generation (at least for now). Realistically for AI tooling integration that works we are going to need a server in between that can manage pulling the right chunks of documentation for any given task. The individual md files being produced here provide the right building blocks.

(Also, instead of "Download Markdown version", perhaps "View Markdown version", which just links to the index.md file.)

@mayel
Copy link
Contributor Author

mayel commented Dec 27, 2024

Ah the downloadable docs from hex.pm had completely slipped me by! It may be useful to also include that link in the doc footers next to the ePub.

And yeah all makes sense to me, hoping I find some time to work on it a bit more soon :)

@mayel
Copy link
Contributor Author

mayel commented Dec 27, 2024

we can already download a tarball of all docs from hex.pm (the md files, if generated, would be included by default there as well, correct?)

The epub is included in that ZIP so I'm guessing yes

@mayel
Copy link
Contributor Author

mayel commented Dec 27, 2024

Alright I've experimented with updating the footer so it includes:

  • Hex Package (if a hex package is set)
  • View Code:
    • Source Repo (if a source url is known)
    • Hex Preview (if a hex package is set)
  • View Markdown version (if markdown formatter enabled)
  • Download docs archive (from hex, should include html, markdown and epub)
  • Search HexDocs

Note: source repo, hex preview, and view markdown version all link to the file matching the current module/page.

Screenshot 2024-12-27 at 15 11 32

@mayel mayel marked this pull request as ready for review December 27, 2024 16:55
@mayel
Copy link
Contributor Author

mayel commented Dec 27, 2024

OK I'm starting to feel pretty good with the generated output (tested with a bunch of projects), probably missing some things but could use some feedback on the implementation and test coverage :)

@doc """
Transform AST into a markdown string.
"""
def to_markdown_string(ast, fun \\ fn _ast, string -> string end)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not render the original content instead? 🤔

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do when the original content is markdown. This was needed for cases where an AST node is created manually, like for type specs.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. In this case maybe we should push the functionality to the retriever, so it adds specs both in text and html format.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or maybe we use a separate function that knows how to render the specs for a given node with the given format.

Copy link
Contributor Author

@mayel mayel Jan 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about the autolink functions? They depend on the parsed AST of the extra markdown docs and transform it.

Edit: ah I'm not currently using to_markdown_string to use that transformed AST for the guides, but would be good to do so IMO...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want the markdown links in this case? Would the markdown links be useful for man pages? cc @eksperimental

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dunno. And on the LLM side not sure if links would make a difference, any idea @mjrusso?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @mjrusso that working hyperlinks between all .md would be nice to have in general. For man pages, I'm not sure they're required, but they might be useful to be able to extract the links and do some ad-hoc processing. Man pages are often referring to other man pages on the form "name(section)", especially under SEE ALSO in the bottom of a man page.

For the OTP man pages, it's probably more useful to refer to functions using Erlang syntax like keyfind/3 or lists:keyfind/3. But maybe for modules, we can refer to them using the man page syntax like maps(3erl).

@josevalim
Copy link
Member

Thanks everyone for the work so far. I believe this is a great direction and, at the same time, it shows we need to some clean up before moving forward:

  1. We will want to refactor some of the template handling because there is too much shared logic between HTML, EPUB, Markdown. I think some of it should be moved to the retriever.

  2. I believe it should be multiple files indeed and no zips, since we can also use it to generate man pages and either. See WIP: Markdown formatter #1992, which is now a duplicate? /cc @eksperimental

@mayel
Copy link
Contributor Author

mayel commented Jan 2, 2025

  1. We will want to refactor some of the template handling because there is too much shared logic between HTML, EPUB, Markdown. I think some of it should be moved to the retriever.

Is this about DRY and not having duplicate logic (which I tried to address by introducing the ExDoc.Formatter module), or about having more separation between the code for each format? Some guidance would be helpful here.

  1. I believe it should be multiple files indeed and no zips, since we can also use it to generate man pages and either.

Yeah it's generating separate files now, following the same structure/naming as the html ones.

See WIP: Markdown formatter #1992, which is now a duplicate? /cc @eksperimental

Ah yeah seems so, I can look at that PR to see if there's any approach or piece of code (thinking especially of the templates) that looks better and port them to this one?

@eksperimental
Copy link
Contributor

Hi everyone. I woud like to discuss about this more in detail, I think we could open up an issue so we don't divert the conversation from this PR? Doing the formatter I noticed the duplication but also the limitations of the current approach.

@garazdawi
Copy link
Contributor

I created a gist with the markdown docs for the Erlang stdlib. I think the results look good, but as mentioned in other comments it could be nice to have links working.

I also think that specs/types/callbacks should be inside ```erlang/elixir blocks. That means that we need to remove the links, but as links are not working anyway it does not matter.

I also noticed that the equiv metadata is not rendered for Erlang docs.

I did a quick attempt att fixing markdown_to_man.escript and the generated output looks nice enough:
image

though one can probably spend an infinite amount of time fixing the many many small formatting issues that pop up in various places.

@mayel
Copy link
Contributor Author

mayel commented Jan 18, 2025

I also think that specs/types/callbacks should be inside ```erlang/elixir blocks. That means that we need to remove the links, but as links are not working anyway it does not matter.

how are links not working? and yeah it's either having links or formatting there, not sure which is preferable...

@mayel
Copy link
Contributor Author

mayel commented Jan 18, 2025

Hi everyone. I woud like to discuss about this more in detail, I think we could open up an issue so we don't divert the conversation from this PR? Doing the formatter I noticed the duplication but also the limitations of the current approach.

were you going to open an issue @eksperimental? otherwise not sure how to proceed here @josevalim?

@josevalim
Copy link
Member

I should also say that we are adding search over HexDocs, which would allow you to search only certain packages for a given term, and submit the filtered results to a LLM. Would that be better than giving the whole docs of a bunch of deps? Which I assume would consume too many tokens?

@mjrusso
Copy link

mjrusso commented Jan 18, 2025

I should also say that we are adding search over HexDocs, which would allow you to search only certain packages for a given term, and submit the filtered results to a LLM. Would that be better than giving the whole docs of a bunch of deps? Which I assume would consume too many tokens?

Yes, there is downstream work required to effectively use the documentation (at least for LLM consumption), which also happens to look a lot like a search problem.

This Livebook is a simple prototype; there's tons of opportunities for improvement but it does work and provide reasonable results. (This happens to use hex2txt to get the docs, but the nice part is that the approach is general and could work with any Markdown as input. I want a standalone app like this that I can run locally that exposes as a Model Content Protocol server, but that's getting off topic :)

@garazdawi
Copy link
Contributor

how are links not working?

The links in specs are working, but the autolinks in the markdown documentation does not (that is t:String.t/0 is not resolved to anything).

yeah it's either having links or formatting there, not sure which is preferable...

I'm going to guess that this depends on what it will be used for. For the usecase that @zuiderkwast wants (that is converting to man pages), the formatting is to prefer as there are no links in man pages anyway. Either way it is easy enough for some postprocessing tool to strip links and re-format the specs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

6 participants